Atrous residual convolutional neural network based on U-Net for retinal vessel segmentation

Extracting features of retinal vessels from fundus images plays an essential role in computer-aided diagnosis of diseases, such as diabetes, hypertension, and cerebrovascular diseases. Although a number of deep learning-based methods have been used in this field, the accuracy of retinal vessel segmentation remains challenging due to limited densely annotated data, inter-vessel differences, and structured prediction problems, especially in areas of small blood vessels and the optic disk. In this paper, we propose an ARN model with a atrous block to address these issues, which can avoid the loss of data structure, and enlarge the receptive field, so that each convolution output contains a larger range of information. In addition, we also introduce residual convolution network to increase the network depth and improve the network performance.Some key parameters are used to measure the feasibility of the model, such as sensitivity (Se), specificity (Sp), F1-score (F1), accuracy (Acc), and area under each curve (AUC). Experimental results on two benchmark datasets demonstrate the effectiveness of the proposed methods, which accuracy are 0.9686 on the DRIVE and 0.9746 on the CHASE DB1. The segmentation structure can assist the doctor in diagnosis more effectively.


Introduction
Diseases such as diabetes, hypertension, and diseases of the retina are shown in retinal vascular images [1]. The analysis of the number, angle, branch, and curvature of retinal blood vessels can provide a valuable basis for clinical diagnosis [2], for the purpose of early prevention, diagnosis, and treatment. With the development and popularity of optical coherence tomography (OCT) imaging technology and the increasing emphasis on early disease diagnosis, the number of fundus images is increasing rapidly, and their analysis will require much time and effort [3]. In addition, differences in image acquisition procedures between machines and institutions may lead to huge differences in resolution, noise, and tissue appearance, which increase the difficulty of analysis [4].
To meet the need of this work, a fast and automatic segmentation method for retinal vascular images came into being [5,6]. It can improve the cutting efficiency and accuracy, reduce the waiting time of patients and save medical resources [7,8]. It can also provide basis for fundus image registration, arteriovenous classification and biometric recognition [9]. Notably, deep learning methods have performed better than traditional methods [10][11][12][13].
Among the unsupervised methods, filtering, morphological transformation and modelbased algorithms are dominant [14]. The method was evaluated on DRIVE and STARE databases and returned accuracies of 0.945 and 0.9486 respectively.Wavelet transform was used by Akram et al. and Soares et al. in retinal vascular segmentation [15], and they achieved 94.4% accuracy in STARE. 2D Gabor Wavelet and Gaussian mixture models are used in their approach. In particular, this approach is heavily influenced by the quality of the image. The entropy of some particular antennas with a pre-fractal shape, harmonic sierpinski gasket and weierstrass-mandelbrot fractal function were studied, and the result indicated that their entropy is linked with the fractal geometrical shape and physical performance [16,17], and they achieved 94.3% and 94.4% accuracy in DRIVE and STARE. This method is a kind of unsupervised technique, and the calculation is fast, but the accuracy is limited. Frangi et al. proposed a multi-scale enhanced-vessel filtering method to enhance vascular and vascular-like patterns, in which second-order local structural features were used [18]. Sato et al. applied three-dimensional (3D) multi-scale line filtering to the segmentation of cerebrovascular, bronchial and liver vessels [19]. This method can improve the continuity of the circuit structure and reduce the noise, but the calculation is large and slow. Jiang et al. proposed a universal vessel segmentation framework based on adaptive local threshold and applied it to retinal vessel segmentation [20], they observed 65% true positive rate, and This method has more parameters and slower operation. Zhang et al. proposed a filter based segmentation method for retinal vessels, which uses a locally adaptive derivative filter [21], and they achieved 94.76% and 95.54% accuracy in DRIVE and STARE. Azzopardi et al. improved COSFIRE operator detection and applied it in retinal segmentation [22], and they achieved 94.27% and 94.11% accuracy in DRIVE and CHASE_DB1. Zhao et al. designed a new retinal vessel segmentation model using an infinite parameter active contour model with mixed regional information [23], and they achieved 95.4% and 95.6% accuracy in DRIVE and STARE. Liang et al. proposed a level set method for vessel segmentation based on regional energy fitting information and shape prior probability [24] and they achieved 95.03% and 95.36% accuracy in DRIVE and STARE. The framework of unsupervised segmentation method always uses the filtering method which is sensitive to blood vessels or vessel-like, which will lead to the incomplete blood vessels and the misidentification of vessel-like parts. Moreover, the parameter setting has great influence on the final segmentation result.
For the supervised methods, ground truth must be used to train the classifier, and then the classifier can be used to extract the blood vessels. The features of retinal blood vessels can be extracted by multiple methods [25,26]. Traditional machine learning methods for training classifiers use k nearest neighbors, adaboost, random forest and other methods [27], and they achieved 92.9% accuracy in DRIVE. Orlando et al. proposed a fully connected conditional random field model for retinal vascular segmentation, using structured output support vector machine learning model parameters as an example to improve the effect [28], and they achieved 0.8741 and 0.8628 G-mean value in DRIVE and STARE. Zhang et al. applied retinal vascular segmentation through filtering and wavelet transform strategy, and used random forest training strategy [29], and they achieved 94.66% and 95.47% accuracy in DRIVE and STARE. In the above methods, the key feature selection has a great influence on the final segmentation result, such as whether features are independent or easy to identify. However, these features must be selected through people's experience, The features need to be selected manually according to the experiment; so there is still much work to be done to perfect their shortcomings.
Benefiting from the rapid development of computer hardware, convolutional neural networks (CNNs) [30] have become the main machine learning method. Many CNN-based classification and detection methods have been proposed, which have facilitated the rapid development of medical assisted diagnosis methods. In the process of retinal vascular segmentation, the proportion of vascular areas, especially capillaries, is relatively small. To achieve a better segmentation effect, it is necessary to increase the number of training sets and the amount of training time. However, the number of training sets is limited in existing public datasets. To solve this problem, Ronneberger et al. proposed U-Net [31], which combines coarse and fine features through skip connections, can achieve better accuracy(95.34% and 95.78% accuracy in DRIVE and STARE) with fewer training sets. Many methods based on U-Net have achieved good results, but there are still problems such as low accuracy, poor sensitivity, and segmentation area error, especially the loss of vessel branch points, intersection points, and small vessels.The results by other authors are summarized in Table 1.
We propose ARU-Net, a deep learning model to automatically segment retinal blood vessels in fundus images. The model leverages the strengths of U-Net, cascaded atrous convolution, and residual blocks enriched with squeeze and excitation. Residual blocks [32] are used as building units to simplify the training process and help extract coarse and fine features from source images. Squeeze and excitation units are added to each remaining block for channel attention, adaptive feature recalibration, and increased feature power representation. The addition of a dilated convolution module can ensure global and multi-scale extraction. We evaluated our model on the publicly available DRIVE [33] and CHASE DB1 [34] datasets, and the results show that it is effective, and the performance is improved. The proposed approach has the following contributions.
1. We propose a U-Net model integrating modified residual blocks to improve network performance.
2. An improved hybrid atrous convolution is used to increase the receptive field.
The rest of this paper is organized as follows: Section 2 analyze relevant literature. Section 3 presents the proposed method; Section 4 analyzes and discusses the experiment result; Section 5 concludes this study.

Related work
Due to the excellent performance of deep learning framework in retinal vascular segmentation, We will analyze ralated works using typical deep learning architectures.

U-Net
U-Net can be divided into three parts: left (down-sampling), middle (copy and crop), and right (up-sampling). The first part reduces the size of the picture through four down-sampling operations, which extract features from shallow information. The copy and crop part includes four splicing operations. This operation fuses characteristic deep and shallow information. In the up-sampling part, the picture is larger, and deep information is extracted through four upsampling operations. In the process of up-sampling, the number of channels in the image is halved, which is contrary to the change of the number of channels in feature extraction in the left part [35]. The up-sampling process fuses the shallow information on the left and splices the features. A skip connection is used in U-Net at the same stage, ensuring that the recovered feature graph integrates more low-level features and features of different scales. In this way, multi-scale prediction and deep supervision can be carried out, and information such as edge recovery of segmentation maps can be more refined [36].

ResNet
It has been found that deeper network layers and smaller receptive fields can improve neural network performance. However, as the network structure deepens, two problems arise. First, vanishing and exploding gradients affect the convergence of training. Second is degradation. An increasing number of layers causes the model accuracy to decrease (which is not caused by overfitting), and the training and testing error both increase.To overcome these problems, the The residual network proposed by He et al. [37] shows significantly improved training characteristics, allowing previously unachievable network depths.
In contrast to the traditional convolutional or fully connected layer, ResNet has many bypass branches that connect the input directly to the following layer, so as to directly learn the residuals. This structure is also known as a shortcut connection. Such a structure can directly detour the input information to the output to protect its integrity. The network only needs to learn the input and output differences of that part, simplifying the learning objectives and decreasing difficulty.

Method
Using U-Net as the basic framework in medical image segmentation can solve the problem of small samples that commonly exist in such images. However, U-Net is composed of a contraction path that gradually reduces the spatial dimension of the image through down-sampling, and an expansion path that gradually restores the details and spatial dimension of the object through up-sampling. Therefore, after convolution and down-sampling, there will be gradient disappearance, structural information loss, and other problems.
In this study, We integrate a residual network (ResNet) and atrous convolution modules into the U-Net network in a new network structure, the atrous residual U-Net (ARU-Net), which can further expand the receptive field and improve the correlation between objects without losing information, thus improving the performance of vascular segmentation.This framework is shown in Fig 1. It consists of two phases of training and testing. In the training stage, color fundus images were preprocessed with grayscale transformation and normalization, and then used as training data. The network adjusts model weight parameters by iterative learning. Then save the weights. In the testing stage, the network reloads the saved weight information and makes predictions for the preprocessed data.

Modified residual block
To further improve the performance of the network, we include a squeeze-and-excitation block in ResNet [38], the difference from the original residual network is shown in Fig 2B and  2C. Squeeze (red box, Fig 2(C)) can change the spatial dimension of each input feature map from H×W squeeze to 1×1. We use global average pooling to achieve this. In the squeeze operation, we perform feature compression along the spatial dimension, turning each two-dimensional feature channel into a real number that has a global receptive field, and the dimension of the output matches the number of feature channels of the input. It represents the global distribution of responses on the characteristic channel and enables the layer close to the input to obtain the global receptive field, which is useful in many tasks.
Where z c represents the channel descriptor for channel c, F sq represents global average pooling, u c represents channel c of the input, and H and W represent the height and width of the input. In excitation (green box, Fig 2(C)), the feature dimension is reduced to 1/r of the input, and raised back to the original dimension through a fully connected layer after ReLU activation. This method, which is more nonlinear and can better fit the complex correlation between channels than the method that directly uses a fully connected layer, greatly reduces the number of parameters and the amount of calculation.

Hybrid atrous convolution block
As is well known, atrous convolution can enlarge the receptive field [39,40]. When the convolution kernel is 3×3, 1-dilated and 2-dilated together can achieve the effect of a 7×7 convolution kernel. Similarly, when 4-dilated conv is followed by 1-dilated and 2-dilated conv, the receptive field can achieve the effect of a 15×15 convolution kernel. Compared with traditional convolution operations, the receptive field of atrous convolution grows exponentially.
However, when we only stack convolution with the same void rate many times, the kernel is not continuous, i.e., not all pixels are used for calculation. Therefore, the information is regarded as checkerboard, which will lose continuity, and does not work well for small objects.
We use a hybrid atrous convolution block to solve this problem, where r i is the void rate of the i-th layer, and M i is the maximum void rate at layer i. Assuming there are n layers, the default is M n = r n . If we apply a k×k convolution kernel and our goal is that M 2 �K, then we can cover all the holes using standard convolution. Proposed U-net Block is shown in the Fig 2A. We replaced the original COV3×3 block in the original U-net with AR Conv Unit. The red arrow represents AR Conv Unit, it is also an important difference from U-Net, and we can see the specific algorithm flow from Algorithm 1.

Algorithm 1: Algorithm of the proposed ARU convolution unit
Input: Feature map X Output: Feature map Y  The experiments have been conducted in a desktop computer with intel core i5-9400 processor CPU, 32GB RAM, and NVIDIA 1080Ti, 11 GB GPU. Adam optimization method was used to optimize the parameters. Since the phased training can speed up the convergence of the network, we adopted different parameters in the training process [41]. When using the Drive data set, we find that the first 200 epochs use learning rate of 0.003, followed by the learning rate of 0.0001 can achieve good results. And the same is true for CHASE DB1. The difference is that we set the batch size to be 2 and 1 respectively. The relevant model dimensions are listed in Table 2.

Evaluation metrics
To quantitatively evaluate our model, we compared the segmentation results with the corresponding ground truth; divided the results of each pixel into true positive (TP), false positive (FP), false negative (FN), and true negative (TN); and adopted sensitivity (Se), specificity (Sp), F1-score (F1), and accuracy (ACC) to evaluate the performance of the model. Acc We also utilize the area under the ROC curve (AUC), where a value of 1 indicates perfect segmentation.

Results
The results in Table 3 are our results on public datasets, the DRIVE and the CHASE DB1. We compare the single use of residual networks, the single use of empty convolution and our approach. When the residual convolution block or the hybrid atrous convolution block is added separately, the segmentation performance of the network can be improved. The experimental results show that when the above two kinds of convolution are added to the network, the segmentation network can get better results. Fig 3 shows some examples from our experiments. From the segmentation results of the two datasets, our results are very close to the gold standard, especially for some small blood vessels. We also verify the influence of different blocks on the results. According to the results in Table 1, our method is superior to the previous best in some key parameters. The highest ACC (0.9686%/0.9746%), the highest AUC(0.9842/0.9869), and the highest Se(0.8149/0.8420). That means residual convolution blocks and hybrid atrous convolution block are very useful for networks.
The proposed method was also compared with U-Net in a segmentation experiment, . We can conclude from these results that our method can segment more vascular details, especially in capillary vessels (marked by a red box).
Finally, we compared ARU-Net with several state-of-the-art methods on the DRIVE and CHASE DB1 datasets, with results as shown in Table 4, which shows that ARU-Net performs best on both datasets as measured by Se, Sp, ACC, F1, and AUC. In details, on the DRIVE and CHASE DB1, our model has the highest AUC (0.21%/0.09% higher than the best before), the highest accuracy (1.08%/0.85% higher than the best before) and the highest sensitivity. F1 and specificity are generally comparable. Hence, our method achieves state-of-the-art performance for retinal vessel segmentation. In addition, adding multi-scale strategy can improve network performance, just like on the Drive dataset. The Se, Acc and AUC were improved (Se0.7772, ACC0.9553, AUC0.9759 respectively) [55,56]. In their study, an improved cross entropy loss function is applied, and uses CRFs as a post-processing strategy. The challenge is how to exploit the relationships between images at different scales. The segmentation result is greatly affected by the weight coefficient, which needs to be set manually. And compared with some other methods, the performance improvement is not obvious. For convolution neural network method with reinforcement sample learning strategy proposed by Guo et al. [57], Sp and ACC value was the lowest, and the final segmentation result was the worst. For U-net based on patch-based learning strategy, Se, Sp, ACC and AUC value were not the highest; however, the segmentation result was the best in the comprehensive evaluation.

Discussion and conclusion
In medical image segmentation, common methods of data augmentation include random slice, rotation and mirror image, etc [58,59]. In general, the accuracy of the model on the training set is significantly increased, but in the validation set, the accuracy is not significantly improved, that is to say, the generalization ability of the model is not substantially improved. In order to improve the overall performance of the network, such as Se, Sp, ACC, AUC, and AUC, it is necessary to adjust the structure of the network, change the supervision function and the optimizer. Unlike other data augmentation approaches, our method operates on the entire image. This has proven to be beneficial as our model is faster than the above methods. In addition, our model is able to obtain more natural and continuous segmentation masks and capture more detailed features. Furthermore, the introduction of hybrid atrous convolution blocks and modified residual blocks in our framework made it possible to utilize image at multiple scales, with corresponding oversight at each scale, helping our model to efficiently aggregate the outputs of different stages.
We presented ARU-Net, a segmentation structure to which atrous and residual convolution were added. This unit enlarges the receptive field without losing resolution. We replaced ReLU with LeakyReLU in the downsampling process. We evaluated the method on the DRIVE and CHASE DB1 benchmark datasets, and accuracy and sensitivity metrics demonstrated that our model can segment fundus vessels better than other models. The branches of many vessels, including very small vessels, were correctly segmented. Fundus diseases often reflect changes in the small shape of blood vessels. Therefore, the above methods can help doctors to diagnose diseases. However, the segmentation rupture of blood vessels can still occur in images with lesions. In the future, we'll continue to explore how to ameliorate the problem of broken blood vessels, so that the segmentation results are closer to the real.The experimental results on the DRIVE and CHASE DB1 datasets are shown in Figs 6 and 7.